Machine Learning Fundamentals: Company Segmentation

Author

Gabriel Storch

Published

June 16, 2024

Challenge Summary

Your organization wants to know which companies are similar to each other to help in identifying potential customers of a SAAS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments.

You will be using stock prices in this analysis. You come up with a method to classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).

You can analyze the stock prices using what you’ve learned in the unsupervised learning tools including K-Means and UMAP. You will use a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns.

Objectives

Apply your knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in the S&P 500 Index. You will specifically apply:

Modeling: kmeans() and umap()

Iteration: purrr

Data Manipulation: dplyr, tidyr, and tibble

Visualization: ggplot2 (bonus plotly)

Libraries

suppressWarnings(
  {
    library(tidyverse)
    library(tidyquant)
    library(broom)
    library(umap)
    
  }
)

Data

We will be using stock prices in this analysis. Although some of you know already how to use an API to retrieve stock prices I obtained the stock prices for every stock in the S&P 500 index for you already. The files are saved in the session_6_data directory.

We can read in the stock prices. The data is 1.2M observations. The most important columns for our analysis are:

symbol: The stock ticker symbol that corresponds to a company’s stock price

date: The timestamp relating the symbol to the share price at that point in time

adjusted: The stock price, adjusted for any splits and dividends (we use this when analyzing stock data over long periods of time)

# STOCK PRICES
sp_500_prices_tbl <- readRDS("C:/Projekte/bdml/sp_500_prices_tbl.rds")
sp_500_prices_tbl

The second data frame contains information about the stocks the most important of which are:

company: The company name

sector: The sector that the company belongs to

# SECTOR INFORMATION
sp_500_index_tbl <- read_rds("C:/Projekte/bdml/sp_500_index_tbl.rds")
sp_500_index_tbl

Question

Which stock prices behave similarly?

Answering this question helps us understand which companies are related, and we can use clustering to help us answer it!

Even if you’re not interested in finance, this is still a great analysis because it will tell you which companies are competitors and which are likely in the same space (often called sectors) and can be categorized together. Bottom line - This analysis can help you better understand the dynamics of the market and competition, which is useful for all types of analyses from finance to sales to marketing.

Let’s get started.

Step 1 - Convert stock prices to a standardized format (daily returns)

What you first need to do is get the data in a format that can be converted to a “user-item” style matrix. The challenge here is to connect the dots between what we have and what we need to do to format it properly.

We know that in order to compare the data, it needs to be standardized or normalized. Why? Because we cannot compare values (stock prices) that are of completely different magnitudes. In order to standardize, we will convert from adjusted stock price (dollar value) to daily returns (percent change from previous day). Here is the formula.

\[ return_{daily} = \frac{price_{i}-price_{i-1}}{price_{i-1}} \]

First, what do we have? We have stock prices for every stock in the SP 500 Index, which is the daily stock prices for over 500 stocks. The data set is over 1.2M observations.

sp_500_prices_tbl |> glimpse()
#> Rows: 1,225,765
#> Columns: 8
#> $ symbol   <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
#> $ date     <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009-01-08, …
#> $ open     <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.52, 19.53…
#> $ high     <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.99, 19.68…
#> $ low      <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.52, 19.01…
#> $ close    <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.82, 19.09…
#> $ volume   <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 49815300, 5…
#> $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15.23408, 1…

Your first task is to convert to a tibble named sp_500_daily_returns_tbl by performing the following operations:

sp_500_daily_returns_tbl <- sp_500_prices_tbl %>% 
  # Select the symbol, date and adjusted columns
  select(symbol, date, adjusted) %>%
  # Filter to dates beginning in the year 2018 and beyond
  filter(date >= "2018-01-01") %>%
  # Group by symbol
  group_by(symbol) %>%
  # Compute 1 day lag
  mutate(lag_1_d = lag(adjusted)) %>%
  # Remove NA values
  filter(!is.na(lag_1_d)) %>%
  # Compute difference between adj and lag
  mutate(diff = adjusted - lag_1_d) %>%
  # Compute the percentage difference
  mutate(pct_return = diff / lag_1_d) %>% 
  # Ungroup
  ungroup() %>%
  # Select only relevant columns
  select(symbol, date, pct_return) 
  # Output: sp_500_daily_returns_tbl
  sp_500_daily_returns_tbl

Step 2 - Convert to User-Item Format

The next step is to convert to a user-item format with the symbol in the first column and every other column the value of the daily returns (pct_return) for every stock at each date.

We’re going to import the correct results first (just in case you were not able to complete the last step).

o_sp_500_daily_returns_tbl <- read_rds("C:/Projekte/bdml/sp_500_daily_returns_tbl.rds") 
# I was able to correctly complete the last step:
all(sp_500_daily_returns_tbl == o_sp_500_daily_returns_tbl)
#> [1] TRUE

Now that we have the daily returns (percentage change from one day to the next), we can convert to a user-item format. The user in this case is the symbol (company), and the item in this case is the pct_return at each date.

Spread the date column to get the values as percentage returns. Make sure to fill an NA values with zeros.

Save the result as stock_date_matrix_tbl

stock_date_matrix_tbl <- sp_500_daily_returns_tbl %>%
  # Select the symbol, date and pct_return columns
  select(symbol, date, pct_return) %>%
  # Spread the date column to get the values as percentage returns
  spread(key = date, value = pct_return) %>%
  # Fill NA values with zeros
  replace(is.na(.), 0)
# Output: stock_date_matrix_tbl
stock_date_matrix_tbl

Step 3 - Perform K-Means Clustering

Next, we’ll perform K-Means clustering.

We’re going to import the correct results first (just in case you were not able to complete the last step).

stock_date_matrix_tbl1 <- read_rds("C:/Projekte/bdml/stock_date_matrix_tbl.rds")
# Assert we got it all right
all(stock_date_matrix_tbl1 == stock_date_matrix_tbl)
#> [1] TRUE

Beginning with the stock_date_matrix_tbl, perform the following operations:

 # Create kmeans_obj for 4 centers
kmeans_obj <- stock_date_matrix_tbl %>% 
  # Drop nun-numeric col
  select(-symbol) %>% 
  # Perform Kmeans
  kmeans(centers=4, nstart=20)

Use glance() to get the tot.withinss.

# Apply glance() to get the tot.withinss
withinss <- glance(kmeans_obj)[["tot.withinss"]]
withinss
#> [1] 29.20555

Step 4 - Find the optimal value of K

Now that we are familiar with the process for calculating kmeans(), let’s use purrr to iterate over many values of “k” using the centers argument.

We’ll use this custom function called kmeans_mapper():

kmeans_mapper <- function(center = 3) { 
  stock_date_matrix_tbl %>% 
    select(-symbol) %>% 
    kmeans(centers = center, nstart = 20) 
  }

Apply the kmeans_mapper() and glance() functions iteratively using purrr.

# Use purrr to map
k_means_mapped_tbl <- tibble(centers=1:30) %>%
  mutate(k_means=map(centers, kmeans_mapper)) %>%
  mutate(glance = map(k_means, glance))
# Output: k_means_mapped_tbl
k_means_mapped_tbl

Next, let’s visualize the “tot.withinss” from the glance output as a Scree Plot.

Begin with the k_means_mapped_tbl

# Visualize Scree Plot
plt_tbl <- k_means_mapped_tbl %>% unnest(glance)

ggplot(plt_tbl, aes(x = centers, y = tot.withinss)) +
  geom_point() +  
  geom_line() +
  labs(x = "Centers", y = "Total Within SS") +  
  ggtitle("Scree Plot") 

We can see that the Scree Plot becomes linear (constant rate of change) between 5 and 10 centers for K.

Step 5 - Apply UMAP

Next, let’s plot the UMAP 2D visualization to help us investigate cluster assignments.

We’re going to import the correct results first (just in case you were not able to complete the last step).

k_means_mapped_tbl <- read_rds("C:/Projekte/bdml/k_means_mapped_tbl.rds")

First, let’s apply the umap() function to the stock_date_matrix_tbl, which contains our user-item matrix in tibble format.

# Apply UMAP, Store results as: umap_results
umap_results <- stock_date_matrix_tbl %>%
  select(-symbol) %>%
  umap()

Next, we want to combine the layout from the umap_results with the symbol column from the stock_date_matrix_tbl.

# Convert umap results to tibble with symbols
umap_results_tbl <- umap_results$layout %>% 
  as_tibble() %>% 
  bind_cols(stock_date_matrix_tbl$symbol) %>%
  rename(symbol = ...3)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> New names:
#> • `` -> `...3`
# Output: umap_results_tbl
umap_results_tbl

Finally, let’s make a quick visualization of the umap_results_tbl.

 # Visualize UMAP results
umap_results_tbl %>%
  # x = V1, y = V2
  ggplot(aes(x = V1, y = V2)) +
  # transparent points
  geom_point(alpha = 0.5) + 
  theme_tq() +
  labs(title = "UMAP Projection") 

We can now see that we have some clusters. However, we still need to combine the K-Means clusters and the UMAP 2D representation.

Step 6 - Combine K-Means and UMAP

Next, we combine the K-Means clusters and the UMAP 2D representation

k_means_mapped_tbl <- read_rds("C:/Projekte/bdml/k_means_mapped_tbl.rds") 
umap_results_tbl <- read_rds("C:/Projekte/bdml/umap_results_tbl.rds")

First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot flattens. Have a look at the business case to recall how that works.

# Get the k_means_obj from the 10th center, store as k_means_obj
k_means_obj <- k_means_mapped_tbl[which(k_means_mapped_tbl$centers == 10),]$k_means

Next, we’ll combine the clusters from the k_means_obj with the umap_results_tbl.

# Use your dplyr & broom skills to combine the k_means_obj with the umap_results_tbl

# augment(k_means_obj, data = stock_date_matrix_tbl)
# Error: No augment method for objects of class list
# ~> unfortunately the k means object is not recognized as such by augment, so i do a work-around

umap_kmeans_results_tbl <- k_means_obj[[1]]$cluster %>%
  # merge cluster info w/ stock date matrix
  as.tibble() %>%
  bind_cols(stock_date_matrix_tbl) %>%
  rename(.cluster = value) %>%
  # projecting to cluster and symbol
  select(.cluster, symbol) %>%
  # left join with umap res on symbol
  left_join(umap_results_tbl, by="symbol") %>%
  # enrich with sp_500_index_tbl info
  left_join(sp_500_index_tbl %>% 
              select(symbol, company, sector),
            by = "symbol")
#> Warning: `as.tibble()` was deprecated in tibble 2.0.0.
#> ℹ Please use `as_tibble()` instead.
#> ℹ The signature and semantics have changed, see `?as_tibble`.
# Output: umap_kmeans_results_tbl
umap_kmeans_results_tbl

Plot the K-Means and UMAP results.

# Visualize the combined K-Means and UMAP results
umap_kmeans_results_tbl %>%  
  # x = V1, y = V2
  ggplot(aes(x = V1, y = V2, color = .cluster)) +
  # transparent points
  geom_point(alpha = 0.5) + 
  theme_tq() +
  labs(title = "UMAP Projection") 

Congratulations! You are done with the 1st challenge!